28 research outputs found

    Efficient two-sample functional estimation and the super-oracle phenomenon

    Full text link
    We consider the estimation of two-sample integral functionals, of the type that occur naturally, for example, when the object of interest is a divergence between unknown probability densities. Our first main result is that, in wide generality, a weighted nearest neighbour estimator is efficient, in the sense of achieving the local asymptotic minimax lower bound. Moreover, we also prove a corresponding central limit theorem, which facilitates the construction of asymptotically valid confidence intervals for the functional, having asymptotically minimal width. One interesting consequence of our results is the discovery that, for certain functionals, the worst-case performance of our estimator may improve on that of the natural `oracle' estimator, which is given access to the values of the unknown densities at the observations.Comment: 82 page

    USP: an independence test that improves on Pearson's chi-squared and the G-test.

    Get PDF
    We present the U -statistic permutation (USP) test of independence in the context of discrete data displayed in a contingency table. Either Pearson's χ 2 -test of independence, or the G -test, are typically used for this task, but we argue that these tests have serious deficiencies, both in terms of their inability to control the size of the test, and their power properties. By contrast, the USP test is guaranteed to control the size of the test at the nominal level for all sample sizes, has no issues with small (or zero) cell counts, and is able to detect distributions that violate independence in only a minimal way. The test statistic is derived from a U -statistic estimator of a natural population measure of dependence, and we prove that this is the unique minimum variance unbiased estimator of this population quantity. The practical utility of the USP test is demonstrated on both simulated data, where its power can be dramatically greater than those of Pearson's test, the G -test and Fisher's exact test, and on real data. The USP test is implemented in the R package USP

    EFFICIENT MULTIVARIATE ENTROPY ESTIMATION VIA k-NEAREST NEIGHBOUR DISTANCES

    Get PDF
    Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this paper, we seek entropy estimators that are efficient and achieve the local asymptotic minimax lower bound with respect to squared error loss. To this end, we study weighted averages of the estimators originally proposed by Kozachenko and Leonenko (1987), based on the kk-nearest neighbour distances of a sample of nn independent and identically distributed random vectors in Rd\mathbb{R}^d. A careful choice of weights enables us to obtain an efficient estimator in arbitrary dimensions, given sufficient smoothness, while the original unweighted estimator is typically only efficient when d3d \leq 3. In addition to the new estimator proposed and theoretical understanding provided, our results facilitate the construction of asymptotically valid confidence intervals for the entropy of asymptotically minimal width

    Optimal rates for independence testing via U-statistic permutation tests

    Get PDF
    We study the problem of independence testing given independent and identically distributed pairs taking values in a σ\sigma-finite, separable measure space. Defining a natural measure of dependence D(f)D(f) as the squared L2L^2-distance between a joint density ff and the product of its marginals, we first show that there is no valid test of independence that is uniformly consistent against alternatives of the form {f:D(f)ρ2}\{f: D(f) \geq \rho^2 \}. We therefore restrict attention to alternatives that impose additional Sobolev-type smoothness constraints, and define a permutation test based on a basis expansion and a UU-statistic estimator of D(f)D(f) that we prove is minimax optimal in terms of its separation rates in many instances. Finally, for the case of a Fourier basis on [0,1]2[0,1]^2, we provide an approximation to the power function that offers several additional insights. Our methodology is implemented in the R package USP.Comment: 58 pages, 4 figure

    Discussion of 'Multivariate Fisher's independence test for multivariate dependence'

    Full text link
    Invited discussion for Biometrika of 'Multivariate Fisher's independence test for multivariate dependence' by Gorsky and Ma (2022).Comment: 4 page

    Efficient functional estimation and the super-oracle phenomenon

    No full text
    We consider the estimation of two-sample integral functionals, of the type that occur naturally, for example, when the object of interest is a divergence between unknown probability densities. Our first main result is that, in wide generality, a weighted nearest neighbour estimator is efficient, in the sense of achieving the local asymptotic minimax lower bound. Moreover, we also prove a corresponding central limit theorem, which facilitates the construction of asymptotically valid confidence intervals for the functional, having asymptotically minimal width. One interesting consequence of our results is the discovery that, for certain functionals, the worst-case performance of our estimator may improve on that of the natural ‘oracle’ estimator, which itself can be optimal in the related problem where the data consist of the values of the unknown densities at the observations
    corecore